Classifying English Documents by National Dialect

نویسندگان

  • Marco Lui
  • Paul Cook
چکیده

We investigate national dialect identification, the task of classifying English documents according to their country of origin. We use corpora of known national origin as a proxy for national dialect. In order to identify general (as opposed to corpus-specific) characteristics of national dialects of English, we make use of a variety of corpora of different sources, with inter-corpus variation in length, topic and register. The central intuition is that features that are predictive of national origin across different data sources are features that characterize a national dialect. We examine a number of classification approaches motivated by different areas of research, and evaluate the performance of each method across 3 national dialects: Australian, British, and Canadian English. Our results demonstrate that there are lexical and syntactic characteristics of each national dialect that are consistent across data sources.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Classifying and Clustering Dialects of North American English

This paper presents the results of experiments in which machine learning techniques were applied to the problem of determining regional dialect boundaries. Specifically, decision trees classification and k-means clustering were applied to a corpus of phonetic measurements taken from a large survey of North American English vowels. Pairwise classification and clustering experiments were done for...

متن کامل

Do Web Corpora from Top-Level Domains Represent National Varieties of English?

In this study we consider the problem of determining whether an English corpus constructed from a given national top-level domain (e.g., .uk, .ca) represents the national dialect of English of the corresponding country (e.g., British English, Canadian English). We build English corpora from two top-level domains (.uk and .ca, corresponding to the United Kingdom and Canada, respectively) that co...

متن کامل

Towards the Design of the Australian National Corpus

Corpora are becoming more and more important as a research tool for linguists as they are large collections of authentic text. However, not every researcher has the time and resources to compile their own corpus. Large corpora in the world such as the BNC, the ANC or the International Corpus of English (ICE) have been widely used for research on the English language in general or an English dia...

متن کامل

An Analysis of Ministry of Education’s Strategic Plans Based on Favorable Components of English Language Teaching Using Shannon’s Entropy

The present research aims to analyze the content of Ministry of Education’s strategic plans (the Fundamental Reform Document of Education, the Comprehensive National Scientific Plan and the National Curriculum Document) based on Shannon's entropy regarding the favorable components of teaching English. The contents of the Fundamental Reform Document of Education, the Comprehensive National Scien...

متن کامل

The use of shibboleth words for automatically classifying speakers by dialect

Real-world applications using speech recognition must perform well over a range of dialects. Di erences in dialect between the speakers in the training database and the target users often leads to degraded recognition performance. For the BBN Hark Hidden Markov Model (HMM) based system, we have already developed a reasonably e ective technique [1] for dealing with multiple US dialects. The solu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013